White Wine Quality Exploration and Analysis by Yue Pan

This report exoplores a dataset about white wine quality and physicochemical properties. There are approximately 5,000 observations in this dataset. The objective is to find which chemical properties influence the quality of white wines.

Univariate Plots Section

## [1] 4898   13
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

This dataset consists of 13 variables with almost 5,000 observations. However the first variable ‘X’ is the unique identifier which doesn’t have any chemical meaning. In this case, I will exclude this variable and only explore the rest 12 variables, which consists of 11 chemials inputs and 1 output “quality”.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

The distribution of the variable “quality”" appears normal. Quality scores range from 3-9, most of the quality scores fall on 6 and 5, followed by 7.

##     bad average    good 
##     183    4535     180

The variable “quality.band” is a newly created variable based on the quality score, it is an ordered factor with three possible values: “good” ( quality 3-4), “average”(quality 5-7), “bad”(quality 8-9).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The variable “fixed.acidity” seems normally-distributed, although there are some outliers which are much bigger than the median values. Therefore, in the histogram plot I removed some outliers by limiting the range to (4,10). From the histogram, we can see that most of the wines have a fixed acidity level between 6 and 7 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The distribution of volatile acidity also seems nomal and a little bit positively skewed because of the outliers. Therefore, in the histogram I removed some outliers by limiting the range to (0,0.6). From the histogram, we can see that most of the wines have a volatile acidity between 0.2 and 0.3 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

By adjusting the binwidth, we can see that in the normal distribution, there is one obvious “unusual” citric acid value which is just smaller than 0.5 (after zooming in and subsetting the dataset, I found the value is 0.49)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

The original distributon of “residual sugar” is positively skewed, after plotting on a log scale, the distribution appears bimodel, with the residual sugar peaking around 2, and again at around 9. I wonder what this plot look like across the different quality score from 3-9. There are a few outliers as showed in the boxplot.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Most of the wines contain less than 0.05 g/dm^3 chlorides, however, a few wines contain chlorides more than 0.1 g/dm^3. There are quite a lot of outliers in this variable “chlorides”. In the histogram plot, I reduced the range to (0,0.1) to give a better visulalisation. By removing the outliers, the distribution appears normal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

The distribution of free sulfur dioxide appears normal, with the peak around 30-40. It means most of the wines contain 30-40 mg/dm^3 free sulfur dioxide, however, the maximum outlier is 289. The range in the histogram is limited to (0,100).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Similar with free sulfur dioxide, the distribution of total sulfur dioxide also appears normal, with some outliers which can be as big as 440. The total sulfur dioxide peaks at around 120.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    78.0   100.0   103.1   125.0   331.0

The variable “bound.sulfur.dioxide” is a newly created variable, which is the subtraction between total.sulfur.dioxide and free.sulfur.dioxide. The objective was to explore the relationship bettween quality and different forms of sulfur dioxide. From the plots we can see that the distribution of bound sulfur dioxide appears normal, peaking around 100, which is also the median value.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The density distribution of the wines is very close, more than 75% of the wines’ density is smaller than 1, and the maximum density is 1.039.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

All of the wines’ PH value is between 2.7 - 3.9. The pH distribution appears normal,with median and mean value very close to each other.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

75% of the wines have a sulphates level less than 0.55, while the maximum sulphates level is 1.08. We can see that the distribution is a little bit positively skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

The distribution of the alcohol level appears a little bit positively skewed, with most of the wines’ alcohol percentage between 9% - 11%, while the lowest is 8% and the highest is 14.2%.

Univariate Analysis

What is the structure of your dataset?

  • There are 4,898 white whines in the dataset, with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality).
  • The output variable “price” is a ordered factor variable ranging from 3-9, of which 3 is the worst and 9 is the best.

  • Other observations:
    • Most white wines have a quality score 6, which is the median quality.
    • The PH value of the wines is quite close, ranging from 2.72 to 3.82
    • The density of the wines is also very close, ranging from 0.987 to 1.039. More than 75% of the wines having density < 1 .

What is/are the main feature(s) of interest in your dataset?

The main features in the dataset are alcohol level, pH and different forms of acidity. I think alcohol level and differnt forms of acidity probably contributes most to the wine quality after doing some research on the wine quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Density, residual sugar, chlorides, different forms of sulfur dioxide are likey to contriburte to the wine quality.

Did you create any new variables from existing variables in the dataset?

I created two new variables: quality.band and bound.sulfur.dioxide The variable quality.band is an ordered factor, with three possible values: bad(quality score 3-4), average(quality score 5-7), good(quality score 8-9)

The variable bound.sulfur.dioxide is total.sulfur.dioxide minus free.sulfur.dioxide, I created this because the total sulfur dioxide is the amount of free and bound forms of sulfur dioxide. I wanted to see the correlation between quality and both of free and bound forms of sulfur dioxide.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I have noticed that there are quite a lot of outliers in these variables, I wonder if they have an impact on the quality of the wines. So I added boxplots in addition to histograms to visualise the outliers. However, in the univariate analysis, I decided not to remove any data. In the next section for bivariate plots when exploring the relationship between the features and wine quality, I will remove some outliers when appropriate.

Bivariate Plots Section

From the correlation matrix, we can see that alcohol level appears to have a strong correlation with quality,followed by density, compared with other features.

From the correlation matrix, we can see some bigger or minor trends on good-quality wines, including:

More detailed plots will be demonstrated to explore the relationship between quality and these features.

Correlation Coefficient between alchol and quality:

## [1] 0.436

Correlation Coefficient between pH and quality:

## [1] 0.099

The correlation coeffecient value seems low, however, from the box plot of pH value across bad, average and good wine, we can see that better-quality wines tend to have higher pH median value.

Correlation Coefficient between volatile.acidity and quality:

## [1] -0.195

From the plots above, we can clearly see that bad-quality wines tend to have higher volatile acidity.

Correlation Coefficient between density and quality:

## [1] -0.307

By removing some outliers in density, the plots above demonstrate a clear relatively strong relationship between density and quality. As the density increases, the quality decreases.

Correlation Coefficient between chlorides and quality:

## [1] -0.21

From the plots above, we can see that chlorides level decreases as quality score / quality band increases. As there are quite a few outliers, I limited the range of the scatterplot to (0,0.1) to give a clearer visualisation.

Correlation Coefficient between residual.sugar and quality:

## [1] -0.098

The relashionship between residual sugar and quality score/band is not very consistent. Howeve, from the boxplot we can still see that two low level of residual sugar is not good for the quality of the wines.

Correlation Coefficient between total.sulfur.dioxide and quality:

## [1] -0.175

Correlation Coefficient between bound.sulfur.dioxide and quality:

## [1] -0.218

With the plots above regarding relationship between quality and total.sulfur.dioxide, quality and bound.sulfur.dioxide,we can see that the relationship between total.sulfur.dioxide and quality is not as consistent as bound.sulfur.dioxide and quality. However, generally speaking, while quality improves, the total and bound sulfur dioxide descrease. To create better plots, as below I applied log10 scale on bound and total sulfur dioxide.

The plots above show the relationship of some features directly with quality. In the plots below, I will explore more relationship between these features with each other.

## [1] "Correlation value of alcohol & density: -0.78"
## [1] "Correlation value of alcohol & residual.sugar: -0.451"

Correlation Coefficient of alcohol vs. chlorides,free.sulfur.dioxide, bound.sulfur.dioxide, total.sulfur.dioxide:

## [1] "Correlation value of alcohol & chlorides: -0.36"
## [1] "Correlation value of alcohol & free.sulfur.dioxide: -0.25"
## [1] "Correlation value of alcohol & bound.sulfur.dioxide: -0.427"
## [1] "Correlation value of alcohol & total.sulfur.dioxide: -0.449"

The plots above shows the correlation between alcohol level and other features such as density, residual.sugar, chlorides and different forms of sulfur dioxide. We can see that alcohol and density have a very strong correlation. As alcohol level descreases, the density increases. The other features also appear weaken alcohol level as they increases.

Correlation Coefficient of pH vs. fixed.acidity, citric.acid, volatile.acidity:

## [1] "Correlation value of pH & fixed.acidity: -0.426"
## [1] "Correlation value of pH & citric.acid: -0.164"
## [1] "Correlation value of pH & volatile.acidity: -0.032"

The above plots show the relationship between pH and different forms of acidity. It is clear to see that pH has a very strong correlation with fixed acidity, compared to the other forms of acidity (citric acid, volatile acidity).

Correlation Coefficient of density vs. fixed.acidity, residual.sugar, total.sulfur.dioxide, alcohol:

## [1] "Correlation value of density & fixed.acidity: 0.265"
## [1] "Correlation value of density & residual.sugar: 0.839"
## [1] "Correlation value of density & total.sulfur.dioxide: 0.53"
## [1] "Correlation value of density & alcohol: -0.78"

Density has very strong relationship with residual sugar, total sulfur dioxide, followed by fixed acidity. As already mentioned in the previous analysis, density and alcohol have a very strong correlation, they weaken each other.

Correlation Coefficient of total.sulfur.dioxide vs. residual.sugar, chlorides, bound.sulfur.dioxide, free.sulfur.dioxide:

## [1] "Correlation value of total.sulfur.dioxide & residual.sugar: 0.401"
## [1] "Correlation value of total.sulfur.dioxide & chlorides: 0.199"
## [1] "Correlation value of total.sulfur.dioxide & bound.sulfur.dioxide: 0.922"
## [1] "Correlation value of total.sulfur.dioxide & free.sulfur.dioxide: 0.616"

As total sulfur dioxide is the sum of bound and free sulfur dioxide, we can see the strong correlation between them. Other than that, total.sulfur.dioxide and residual.sugar apperas to strenghen each other, so as bound.sulfur.dioxide and chlorides, but in a minor trend.

Correlation Coefficient of citric.acid vs. fixed.acidity, volatile.acidity:

## [1] "Correlation value of fixed.acidity & citric.acid: 0.289"
## [1] "Correlation value of volatile.acidity & citric.acid: -0.149"

It is intersting to find the relationship between the three different forms of acidity. The feature citric.acid and fixed.acidity appear to stengthen each other, while citric.acid and volatile.acidity appear to weaken each other.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • Quality vs. alcohol, density
    • As alcohol level increases, the quality improves.
    • Also as alcohol level increases, the density seems lower.
  • Quality vs. Acidity(fixed acidity, volatile acidity)
    • Good-quality wines seems to have relatively higher pH value, which means less acidic. + PH is correlates strongly with fixed acidity.
    • As volatile acidity increases, the quality seems worse.
  • Quality vs. sulfur dioxide
    • Good-quality wines seem to have higher free sulfur dioxide, while lower bound fulfur dioxide.
  • Quality vs. chlorides
    • Good-quality wines seem to have lower chlorides level.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Density seems to have strong relationships with residual sugar and total sulfur dioxide.

What was the strongest relationship you found?

The strongest relationship pH and fixed acidity; total sulfur dioxide and bound sulfur dioxide. And as mentioned above, density has strong relationship with alcohol and residual sugar.

Multivariate Plots Section

Correlation Coefficient of chlorides & alcohol across quality:

## [1] "Quality 3 : -0.353"
## [1] "Quality 4 : -0.387"
## [1] "Quality 5 : -0.223"
## [1] "Quality 6 : -0.32"
## [1] "Quality 7 : -0.555"
## [1] "Quality 8 : -0.512"
## [1] "Quality 9 : -0.51"

Correlation Coefficient of chlorides & alcohol across quality.band:

## [1] "Quality Band bad : -0.371"
## [1] "Quality Band average : -0.351"
## [1] "Quality Band good : -0.516"

Alcohol and chlorides are two strong features which influence the quality of white wines. The plot above shows the distribution of chlorides and alcohol across the quality scores. The second plot filters out some outliners and the average-quality wines to provide a clearer trend.

Correlation Coefficient of citric.acid & fixed.acidity across quality:

## [1] "Quality 3 : 0.337"
## [1] "Quality 4 : 0.515"
## [1] "Quality 5 : 0.294"
## [1] "Quality 6 : 0.281"
## [1] "Quality 7 : 0.266"
## [1] "Quality 8 : 0.186"
## [1] "Quality 9 : 0.55"

Correlation Coefficient of citric.acid & fixed.acidity across quality.band:

## [1] "Quality Band bad : 0.476"
## [1] "Quality Band average : 0.283"
## [1] "Quality Band good : 0.209"

The relationship between fixed.acidity and citric.acid becomes weaker as the quality improves.

From the second plot above, we can see that good wines tend to have lower fixed acidity; also the citric acid of good wines tends to have a smaller variance.

Correlation Coefficient of pH & fixed.acidity across quality:

## [1] "Quality 3 : -0.755"
## [1] "Quality 4 : -0.466"
## [1] "Quality 5 : -0.427"
## [1] "Quality 6 : -0.379"
## [1] "Quality 7 : -0.492"
## [1] "Quality 8 : -0.476"
## [1] "Quality 9 : -0.828"

Correlation Coefficient of citric.acid & fixed.acidity across quality.band:

## [1] "Quality Band bad : -0.514"
## [1] "Quality Band average : -0.419"
## [1] "Quality Band good : -0.456"

The relationshp between pH and fixed.acidity is consistently strong across the different quality score and band.

Correlation Coefficient of total.sulfur.dioxide & bound.sulfur.dioxide across quality.band:

## [1] "Quality Band bad : 0.879"
## [1] "Quality Band average : 0.928"
## [1] "Quality Band good : 0.874"

The relationshp between total.sulfur.dioxide and bound.sulfur.dioxide is consistently strong across the different quality band.

Correlation Coefficient of total.sulfur.dioxide & free.sulfur.dioxide across quality.band:

## [1] "Quality Band bad : 0.708"
## [1] "Quality Band average : 0.609"
## [1] "Quality Band good : 0.616"

Similar as above, the relationshp between total.sulfur.dioxide and free.sulfur.dioxide is also consistently strong across the different quality band.

Correlation Coefficient of fixed.acidity & volatile.acidity across quality.band:

## [1] "Quality Band bad : -0.047"
## [1] "Quality Band average : -0.033"
## [1] "Quality Band good : -0.127"

The correlation between fixed and volatile acidity is very weak. But from the plots above, we can see that compared to bad quality wines, good quality wines appear to have lower level of fixed acidity and volatile acidity. Much more outliers of fixed and volatile acidity can be found in the bad quality wines.

Correlation Coefficient of residual.sugar & density across quality.band:

## [1] "Quality Band bad : 0.741"
## [1] "Quality Band average : 0.848"
## [1] "Quality Band good : 0.821"

The correlation of density and residual sugar is consistently strong. As mentioned before, good wines tend to have smaller density, which means lower level of residual sugar.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In the multivatiate plots section, i explored the features across quality and quality band: density vs. residual sugar, volatile acidity vs. fixed acidity, free.sulfur.dioxide vs. total.sulfur.dioxide, pH vs. fixed acidity, alcohol, chlorides, citric acid.

  • The features density and residual sugar seem to strenghten each other; as well as total sulfur dioxide and free sulfur dioxide.
  • Afer filtering out the average wines, we can see a clearer contrast between good and bad wines, here are some trends I found out:
    • Good wines tend to have smaller variance on several features, such as totalsulfur dioxide, free sulfur dioxide, citric acid, volatile acidity; which means too high or too low amount in these features is not good for the quality of the wines.
    • Alcohol level is still the strongest feature which influence the quality of the wine. High alcohol level seems to decearse chlorides level.
    • Good wines tend to have higher pH value, which means lower level of fixed acidity.

Were there any interesting or surprising interactions between features?

It is interesting that citric acid seems to have some relationship with volatile acidity. Thery are different forms of acid in the wine, but they seem to weaken each other, though the relationship is not strong.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No


Final Plots and Summary

Plot One

Description One

This set of box plots illustrates the effect of alcohol level on white wine quality. Generally speaking, better quality of wines tend to have higher alcohol level. However, the wines with quality scoring 5 have lower alcohol level than wines with quality scoring 3 and 4.

Plot Two

Description Two

By removing the outliers, filtering out the average-quality wines, this plot clearly demonstrates the relationship between density and residual sugar. Across the different qualities, as the residual sugar level increases, the density increases. This plot also shows the trend that good-quality wines seem to have smaller density and lower level of residual sugar.

Plot Three

Description Three

This plot shows the strong correlation between pH and Fixed acidity across bad and good wines. As the fixed acidity lever increases, PH value decreases. It makes big sense because PH value is a numeric scale used to specify the acidity (when PH is less value than 7), the smaller the ph value, the more acidic it is. The plot also demonstrates the trend that it is not good for the wine quality when pH is too low (which means probably too much acidity).


Reflection

Through this exploratory data analysis on the white wine dataset, I identified the key factors which influence the quality of the wines, including alcohol percentage, pH / acidity , density and chlorides.

At the beginning of the analysis, I struggled because the correlation between the variables in this dataset is generally weak, except only a few ones with relatively stronger relationship. The way I used to sort this issue out is to find the strongest features for quality, which include alcohol and density. And then, I tried to find the strongest features for alcohol, which include residual.sugar, chlorides and differnent forms of sulfur dioxide. The same exploraton was done for density, which is the second strongest feature for quality. The strongest features for density include residual.sugar and different forms of sulfur dioxide. In this way, I tried to find different levels of connection between the variables.

However, as the quality score is measured subjectively by wine experts, I believe the correlations within these factors mentioned above are within reasonable bounds. Further study on statistics is suggested in order to confirm the hypothesis quantitatively.